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Individualized Testing and Item Characteristic Curve Theory 

Abstract 



An elementary survey of item characteristic curve theory is presented, 
centered around the problems of individualized ("tailored") testing. 
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Individualized Testing and Item Characteristic Curve Theory*' 

1. Introduction 

In conventional mental testing situations, a group of individuals 
take the same test. Inevitably, an aptitude or achievement test is too 
easy for some individuals and too hard for others. Some may obtain per- 
fect or near perfect scores on the test; others may score near zero. If 
successful random guessing is possible, low scores will be at and below 
the chance level. 

If a test is too easy for some individuals, it will not discriminate 
effectively among them. A helpful analogy for this situation is a high- 
jump contest: one would not try to rank the best jumpers by always setting 

the bar at a level appropriate for mediocre jumpers . Similarly, if a 
test is too hard for some individuals, it will not discriminate effectively 
among them either. One would not try to rank poor jumpers by setting the 1 
bar at a level where none of them clear it. 

If successful random guessing is possible (as on almost all objective 
tests), it is also obvious that the test cannot effectively measure an 
individual who gives random answers to almost all the test questions. 

The "noise" on his answer sheet overwhelms the "signal." This discussion 
suggests that for each individual there is an optimal difficulty level at 
which test questions are most effective for evaluating his performance or 
"ability." 

Let us limit further consideration to the common case where all 
responses to test questions are (treated as) either "right" or "wrong." 

* Preparation of this chapter was supported in part by Grant GB-32781X 
from the National Science Foundation. 
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If there is no guessing, a common rule for effectively measuring performance 
(this is also the rule to which theory will lead us) calls for a difficulty 
level such that the individual will answer half the questions correctly and 
half incorrectly. If questions can be answered correctly by blind guessing, 
then the optimal difficulty level will be somewhat easier than this. 

Clearly it would be desirable to test each individual with questions 
best suited to his ability level. This is likely to be impractical in 
ordinary paper-and -pencil testing situations (but see Lord, 1971a> b, c). 

Now that many educational institutions have high-speed computers, however, 
it is becoming practical to have the computer "tailor" the test for each 
individual tested, administering only test questions that seem appropriate 
for his level . .of ability, as judged from his responses to the questions 
previously administered. 

In order to tailor the test to the individual tested, the computer 
must be able 

1. To predict from the individual's previous responses how he 
would respond to various questions not yet administered 
(these may be more, or less, difficult than any of the ques- 
tions already administered). 

2. To make effective use of this knowledge in picking the ques- 
tion to be administered next. 

3* To assign at the end of the testing a numerical score (or 
interval estimate) somehow representing the "ability" or 
overall level of performance of the individual tested. 
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2. Test Theory for Itemized Tests 

Classical test theory does not provide an appropriate framework 
for dealing with any of these three tasks that the computer (or the 
tailored-test designer) must carry out. Classical test theory is of great 
practical value in the design, construction, pretesting, scoring, statis- 
tical analysis, and interpretation of conventional tests of all kinds. An 
effective theory for similar purposes is urgently needed for individualized 
testing. Without careful design and appropriate scoring, individualized 5 

testing will often be inferior to conventional testing. 

If we are to think meaningfully about "good" testing procedures and 
"inferioi’' procedures . we first need to be clear about the purpose of 
testing. The immediate purpose is not simply to determine the individual's 
actual performance on the particular test questions administered. This 
statement becomes obvious in individualized testing, since here each 
individual is responding to a different set of test questions, so that no 
comparisons among individuals are possible in terms of actual performance. 

Rather, the purpose is to make some inference as to his typical or expected 
performance on a large class of questions like those administered. In 
order to have a convenient label, this typical or expected performance 
will be called the ability of the individual in the area represented by 
the class of test questions. 

If the questions in a class are too heterogeneous, "ability" as 
defined above has little psychological meaning. Science and understanding 
will best be served if we choose to work (at least initially) with classes 
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of questions sufficiently homogeneous so that we are happy to describe 
performance on any one class by a single number, rather than by several. 
This grouping of questions into homogeneous classes will be assumed in 
all that follows (however, see Mulaik, 1972, for a model that avoids this 
assumption). The reader may think of certain spelling tests, vocabulary 
tests, or tests of spatial abilities, among others, as providing good 
practical examples of homogeneous grouping of questions. 

Note: There is no suggestion that an ability as defined here is 

in any sense a genetic, anatomical, neurological, or even psychological 
entity. For example, an "ability" useful in one set of circumstances as 
a dimension for describing individuals might in other circumstances be 
shown to be a composite of several abilities. 

Our main problem is to infer the individual's ability (in the area 
represented by the test) from his performance on certain test questions. 
In order to do this, it is indispensable to have some idea of how the 
individual's responses depend upon his ability. 

1 , 

3» The Guttman Scale 

A simple and appealing model has often been used in the attempt to 
describe the dependence of examinee response on examinee ability. The 
test questions are visualized as hurdles, the height of the hurdle being 
directly related to the difficulty of the question. The ability of the 
examinee completely determines which hurdles he can clear and which he 
cannot. In this deterministic model, all questions below a certain dif- 
ficulty level are answered correctly by a given examinee; all questions 
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above this level are answered incorrectly. A scale of test questions 
displaying this property for all examinees is called a Guttman scale 
(see Torgerson, 1958, Chapt. 12). 

This model is arrived at by asking what we would like a test to do. 

It would be nice if we could know from the examinee's test score (number 
of right answers) exactly how he responded to every question in the test. 
This knowledge can be obtained from a Guttman scale but not from any 
other kind of test. 

Although approximate Guttman scales are of use in sociological 
work and in attitude measurement, they seem to be of little interest 
in aptitude and achievement testing. For one thing, in many common 
situations an ideal aptitude or achievement test should have all items 
of equal difficulty. According to the deterministic hurdle model, all 
examinees should obtain either a zero score or a perfect score on such 
a test. Nothing like this happens in practice, however. The distribution 
of number-right scores is typically bell-shaped, even when we try by 
every means to obtain a U-shaped distribution. 

The Guttman scale assumes that the tetrachoric correlation between 
scores on any two test questions is 1.00. For two questions of medium 
difficulty, this would mean a product -moment correlation of approximately 
1.00 also. Actually, the tetrachoric correlation between typical 
aptitude or achievement test questions is not 1.00 but only about .15* 

The product -moment correlation between questions of medium difficulty 
is about .10. 
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4* Ite.n Characteristic Curve Theory 

If we want a mathematical model capable of fitting typical aptitude 

or achievement test data, we must use a probabilistic rather than a 

deterministic model* Denote the probability that individual a will 

answer a test question correctly by ^p(® a ) - Prob(U a = * Here 

U is a random variable that assumes the value 1 when individual a 
a 

answers correctly, 0 otherwise. The real number 0 & represents the 
ability of individual a . The vector p contains parameters fully 
characterizing the test question ("item" ) administered. The difficulty 
of the item, for example, will be represented by one of the parameters in 
p . All this notation serves only to assert that the probability that 
individual a will answer a question correctly depends only upon the 
ability of the individual and upon certain characteristics of the test 
question. 

The probability Pg (0 ) is to be interpreted here (see next section) 

?i r . ' . 

as a relative frequency over randomly selected test questions all having 

the same characteristics p = p. • There is no consideration here of 

repeated testing- -each individual is tested only once. 

It is natural to assume that P p (9) is an increasing function of 9 . 
The higher the ability level, the greater the probability of a correct 
answer. This will be assumed hereafter. Some typical functions Pp(©) 

(see Lord, 1968 ) are shown in Figure 1 for illustrative purposes. 

We wish to assume that the probability of a correct answer to a ques- 
tion depends only on the individual's ability level and on J3 , not on any 
other known characteristic of his, nor on any other characteristics of 
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the question, nor on any other variable available to us. It follows 
that the probability of an individual answering correctly is not 
altered by knowledge of the actual performance of other individuals. 

Thus, for example, the probability of correct answers to a question by 
both individuals a and a' is given by the product Pp(© a )jPp(9 a i ) ’ 

In addition, it follows that the probability of an individual 
answering a question correctly is not altered by knowledge of his actual 
performance on other questions. Thus, for example, the probability that 
individual a will answer questions i , i' , and i" all correctly 
is given by the product P Q (0 o )P Q| (0_ )P fl n(0 ) • This is called the 

p 8. p 9. p o. 

M M ** 

principle of local independence (Lazarsfeld, 1959)* 

It is instructive to see what would happen if local independence did 
not hold. Suppose that for a certain individual a the probability that 
he will answer randomly chosen questions i , i' , and i" all correctly 
is greater than P (0 )P , (0 )P „(0 ) • If this is not a unique occurrence, 

p 8L p fit p 8L 
^ 

this would mean that there are individuals at ability level 0=0. who 

a 

score systematically higher on these test questions than other individuals 
with the same 0 level* Thus these test questions would be measuring some 
psychological dimension other than Q ♦ This is just the situation that 
the assumption of local independence is designed to exclude* We want to 
deal with a test that measures the ability 0 ; we do not want to deal 
(at least at first) with a test score that may represent either of two 
(or more) psychological dimensions at once. 
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5» An Alternative Model 

It seems necessary St this point to mention another model very commonly 

confused with the . P (0 ) model used here. This o : .i .ur model makes asser- 
ts a 

tions about the probability, to be denoted by P. (0 ) , that a specific 

X a 

individual a answers a specific item i correctly. The two models will 
be distinguished and reasons given for discarding one of them. 

If an individual responds to question i at random, it is clear that 
his probability of success is the reciprocal of the number of possible 
responses to question i • There are many questions, however, for which 
this individual knows the correct answer; for such a question, his 
probability of answering correctly would seem to be virtually 1. There 
may be other questions on which this individual is misinformed; for such 
a question, his probability of answering correctly would seem to be 
virtually 0. 

Consider two individuals, a and b , and two test questions, i and 
j . Individual a happens to know the answer to question i and to be mis 
informed on question j . Individual b happens to know the answer to 
question j and to be misinformed on question i . If we write 
for the probability that individual a answers question i correctly 
we have F^) - 1 , P.(0 a ) - 0 , P^) - 0 , P^) - 1 , appro*!- 
mately. The first tv/o equations considered together imply that question 
i is easier than question j , the last two equations imply just the 
reverse. Thus questions i and j must measure a different ability for 
individual a than they measure for individual b . This is a possible 
model (Meredith, 19 and a possible interpretation, but usually not a 
fruitful one, since usually we want to compare individuals a and b 
along the same ability dimension. 
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In order to avoid the situation just outlined, we will use only 

the model defined in the previous section, which makes no assertions about 

the probability P^(© a ) that a specified individual a answers a specified 

test question i correctly. The model that v/e will use deals instead with 

P (0 ) , which represents the long-run relative frequency of correct answers 
p a 

given by individual a when answering test questions all having the same 

specified |3 • An equivalent statement is that P a (© ) represents the 

probability that individual a will answer correctly a question chosen 

at random from all questions having the same f3 • When the model holds, 

the function P Q (©) of © will be referred to as the characteristic 
p 

curve for each item having parameters 3 • 

6. Specialization, Application, and Evaluation 

Empirical checks on the validity and practical utility of the item 
characteristic curve (icc) model have in large part been delayed for about 
twenty-five years because of the difficulty of estimating the characteristic 
curves of particular items. Recently a number of workers have successfully 
estimated many icc and some evidence of the validity and usefulness of the 
model has been accumulated. The present section is intended to refer the 
reader to materials relevant for assessing the validity and usefulness 
of the model; no detailed discussion is possible here. 

An approach that estimates icc without restrictive assumptions about 
their mathematical form has been described by Lord (1970a). If it can be 
assumed simply that the icc differ only by a linear transformation of 0 
(a common assumption), a computer program implementing Levine (1972) has 
been found very effective for estimating icc (Levine, personal communication). 
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lt has been common to assume that the icc are normal ogives or logistic 
curves. If the icc are logistic and if, for a given test, the curves all 



1968, p. 402) to be the same as the well-known Rasch model, which has 
certain desirable measurement properties (Rasch, i960, 1961, 1966a, bi 
Wright, 1968; Wright and Panchapakesan, 1969)* Methods for estimating the 

single item parameter needed in this model and studies evaluating the fit 
and effectiveness of this model have been reported by Rasch, by Wright and 
Panchapakesan, and by Lav/ley (194-5, 1944), Andersen (1970; 1971a, b, 1972a, b), 
Anderson, Kearney, and Everett (1968), Choppin (1968), Fischer (1972); 

Fischer and Scheiblechner (1970); Hambleton (1969); Hambleton and Traub 
(1971), Panchapakesan (1969), Scheiblechner (1971a, b), Tinsley and Dawis 
(1972); Urry (1970). Reports on the fit and effectiveness of the one- 
parameter model range from disapproval to enthusiasm. 

If some test questions correlate higher with ability than others, as 
is commonly the case, a one-parameter model may be inadequate. Whenever 
correct answers can be obtained by random guessing, even a two -parameter 
model is likely to be inadequate. Modified normal ogive and logistic 
models with three parameters are available (Birnbaum, 1968, chapter 17). 

The mathematical formulas are 



have the same slope parameter, the present model can be shown (Birnbaum, 



a(e-e) 



P R (©) = 7 + (1 - 7) f : exp( 



- \ t 2 ) dt 



( 1 ) 



for the modified normal ogive, and 




( 2 ) 
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for the modified logistic. Models (l) and (2) do not differ anywhere by more 
than iOl. We need not debate here which model is more nearly correct. 

Neither, of course, is exactly correct. 

The three parameters in 3 = {d£, 3, 7} may be thought of as 

0£ discriminating power (a measure of the relation between 
item score and ability), 

3 difficulty, 

7 probability of a correct answer for individuals at lowest 
ability levels . 

A more detailed, practical discussion of these item parameters is given in 
Lord (1970b). 

Lord and Novick (1968, section l6.1l) consider for what practical 
situations the normal ogive model is likely to be appropriate. Studies 
evaluating the fit of the model to actual test data include Lord (1952, 1970a, 
1972); Indow and Samejima (1962, 1966). Their findings support the model 
for the data studied. Many more evaluative studies are needed. 

Methods for estimating the item parameters have been developed and tried 
out by Lord (1952, 1968, 1972), Indow and Samejima (1962, 1966), Birnbaum (1968), 
Bock (1970, 1972), Bock and Lieberman (1970), Kolakowski and Bock (1970), 
Kolakowski (1969, 1972), Lees, Wingersky, and Lord (1972). Studies making 
theoretical or practical use of these models appear in two books by Solomon 
(1961, 1965). Included among other such studies are those by Brogden (1946), 
Tucker (1946), Cronbach and Warrington (1952), Lord ( 19 53a, b; 1955, 1970a, 
b; 1971a, b, c, d, e), Cronbach and Merwin (1955), Anderson (1959), Merwin 
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(1959)* Cronbach and Azuma (1962), Paterson (1962), Birnbaum (1968, chapters 
17-20; 1969), Wood and Skurnik (1969)3 Shiba (1969a, b). Shoemaker and 
Osburri (1970) > Urry (1970); Nishisato and Torii (l97l)s Bay (l97l)s 
Hambleton and Traub (l97l) 3 Samejima (1972). 

7. Pretesting 

In order to design a test for the specific purpose of measuring the 
ability of a particular individual, we must have available a large pool of 
test questions that have been extensively pretested, so that the param- 
eters characterizing each question may be considered known. Note that 
the item characteristic function Pp(Q) does not depend on the distribu- 
tion of ability in any group of individuals. Consequently, the parameters 
P for a test question can be determined once and for all by pretesting in 
some convenient group. Of course, reliance on the robustness of the model 
over wide variations in group should not be carried to extremes. In 
practice, the pretest group should resemble the collection of individuals 
who will later be given the individualized tests. 

Corresponding to the invariance of the item parameters p over groups 
of individuals there is an invariance of the ability parameter 9 over 
different tests (cf. Rasch, 1961, pp. 331-333)* These invariances are 
fundamental to the success of item characteristic curve theory in com- 
parison with older item analysis methods* In a leading older method, 
each item would be characterized by the proportion of correct answers 
received and by the correlation between item response and total test 
score. However, these item parameters of the older method would be dif- 
ferent for different pretest groups; also, the correlation parameter would 
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change if the test was lengthened or otherwise modified. This lack of in- 
variance limits the usefulness of classical item analysis. The usual 
kinds of test score for an individual have a similar lack of invariance 
when the test administered is modified. 



8. The Statistical Estimation of Ability 



Once the item parameters R have been determined by pretesting, the 
problem of estimating the ability of an individual from his responses is a 
straightforward statistical estimation problem. If hir, probability of 
success on question i is P. (9) and his probability of failure is 
Q (0) si - P (0) , then the likelihood function for his score 

P . p . 

~i Li 

( u. = 1 or 0 ) on question i is simply 
' i 



L.(9) 



P (9) if U. = 1 , 

Ei 

0 (0) if U . = o 

5i 1 



This may be more conveniently v/ritten 



L.(0) = LPp ^ 



Because of local independence, the likelihood function for the 
individual’s responses to a test of n questions is simply the product 
of the likelihoods for the separate questions: 



L(0) = 5 [P_ (0)] i [Q fl (0)] 

i=l Zi ~i 



1-u. 
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Si nee the are known from pretesting, it is not difficult for e. compu- 

ter, given a mathematical form for Pp(©) such as (l) or (2), to find 
the maximum likelihood estimate 0 of the individual's ability. ( 0 
is the value of 0 that maximizes the likelihood b(0) of his observed 

responses u.,u_, ...,u .) 

12 n 

9. A Simpler Procedure for Estimating Ability 

There still remains the problem of how to pick the n test question^ 
to be administered to a given individual- One advantage of individualized 
testing is that testing can be continued until uj. 2 individual's ability has 
been estimated with some predetermined degree of statistical accuracy. For 
the sake of simplicity, however, we will consider here only the case where 
n is fixed. 

To make matters even more simple, let us select from the pool a 
large set of pretested questions that differ from each other only in 
difficulty ( p. ). These questions have identical values of a and y . 

If (l) or (2) held with no random guessing, the optimal test for estimating 

0 with minimum squared error would consist entirely of questions for which 

P^(0 a ) = ^ , where © a is the ability of the individual to be tested. 

Since we do not knov/ © a and cannot estimate it with any accuracy in 
advance of testing, all this would not give us a method for choosing the 

n test questions to be administered. Such methods will be discussed in 

the next section. 
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Let us assume now (as seems reasonable) that individual ability ( 0 ) 
and item difficulty ( (?• ) are measured along the. same dimension, in the 
sense that for any increment k an increase in ability from 9 to 0 + k 
could hypothetically be exactly offset by an equivalent increase in dif- 
ficulty from p to = p + k . In other words, P, ^(0) and 

\ a > k> 7 1 

P f o v i(0 4 k) represent exactly the same function of 0 . This 

t,u, pr!-K, 7 j 

assumption holds for models (l) and (2) and for any other Pp(0) i n 
which 0 and p appear only as their difference 0 - p • Under this 
assumption Pp(©) = F(0 - B) where F is an unspecified rr.onotonic 
function. 

What we have assumed here is simply that we have a large set of ques- 
tions, selected from the pretested pool, whose icc differ only by a trans- 
lation along the 0 axis. Let us define p'^ as the item difficulty level 
at which the individual has probability of success F(0) . Thus 0 - 3° • 

Q 

We can determine au individual's ability 0 by determining his 3 . 

It is possible in practice to find the proportion of correct answers 
actually given by individual a to test questions at any specified dif- 
ficulty level. By trial and error, or by better methods to be discussed 
below, we can in this way find approximately the difficulty level B° 

such that P f O o -.(0 ) = F(0) . This difficulty level is (approximately) 
7 J a 

Q 

the ability level, of individual a , since by definition 0 = p 

3 . 3 

10. Stochastic Approximation 

Clearly what we need now is some method better than trial and error 
for finding p . Stated in this way, the problem is a standard problem 
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in stochastic approximation (Wasan, 1969 )* Specifically, the stochastic 
approximation problem is to select a sequence of test questions so that 
we can conveniently and accurately estimate the individual's ability © a 
from the sequence u^Ug, . ..,u n of responses. Since for simplicity we 
have selected from the pretest pool a set of test questions that differ 
statistically only on their difficulty parameter (for a treatment that 
avoids this, see Owen, 1970), the problem of selecting a sequence of test 
questions is simply the problem of selecting a sequence ’ 

The resulting sequence of questions constitutes an individualized test 
or tailored test designed for effective measurement of the particular 
individual tested. 

The difficulty 3 ^ of the first question administered can be chosen 
in the same way that we would choose the average difficulty level of the 
questions in a conventional test — by subjective judgment or by using a 
Bayesian prior. If the individual answers the first question incorrectly, 
we guess that it is too hard for him and choose an easier question to 
administer next. If he answers the first question correctly, we guess that 
it is too easy for him and choose a harder question to administer next. 

After administering the second question, we could use the statistical 
method outlined in section 8 to obtain from his responses to the first 
.two questions an estimate 9^^ of the individual's ability. The dif- 
ficulty of the third question administered could be matched to the 

/ '(2) 

individual's estimated ability by choosing 3- = 9' 7 . Ue could then 

j a 

choose 3^3^, ••• similarly. However, a procedure that proceeds by 
steps that are individually optimal is not in this case likely to be 
an optimal procedure overall. 
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We will not try here to devise an optimal procedure. Rather, we will 
try to find a good, simple procedure that is not only easy to carry out 
but also easy to evaluate as a procedure for statistical inference. 

Under the Robbins -Monro stochastic approximation procedure, the rule 
for choosing the difficulty of the (v + l) -st question is 

<V*1 • + d v (u v - F <°» •• (3) 

where d 1 ,d > . ... is a suitable decreasing sequence of positive numbers 
chosen in advance (Robbins and Monro, 1951)* If the step size d^ is small, 
the (v + l) -st question will be chosen to have nearly the same difficulty 
as the v -th question: if d^ is large, there will be a more substantial 
change in difficulty. In the Robbins-Monro procedure, the d^ are chosen 
relatively large initially when little is known about the individual's 
ability level, allowing substantial readjustments in item difficulty levels. 
Later when the appropriate difficulty level has been approximated, the 
chosen d^ are small, eventually approaching zero. Typically d^. = d^/v , 
v = 1,2,3, ••• • 

Robbins and Monro's proof shows that when (3) is used with suitable 

d , the item difficulty 3 , is a consistent estimator of the 

v ’ v-rl 

individual's ability 9 , in the sense that converges stochastically 

to 9 as v becomes large. Formulas leading in some cases to asymptoti- 
cally optimal choices of the d^ are given by Hodges and Lehmann (1956). 

11. The Staircase Method for Selecting the Test Questions 

Unfortunately, the Robbj ns -Monro procedure requires storing 2° 
test questions in the computer before testing is begun, where n is the 
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number (here assumed to be fixed in advance) of questions to be adminis- 
tered to the examinee. For most aptitude and achievement tests composed 
of dichotomously scored questions, n > 23 . 

An alternative procedure, keeping the total number of test questions 
within acceptable limits, is available: the up-and-down method or stair- 

case method, used in testing explosives, in bioassay, in psychophysics, 
and elsewhere. In the up-and-down method, the rule for selecting questions 
is still given by (3), but with d^ replaced by a constant step size d . 

If F(0) = l/2 , the up-and-down rule becomes 

b^ •* d if question v is answered correctly, 

b v - d if question v is answered incorrectly. 

This simple form of (3) normally holds only if there is no guessing of 
correct answers. 

For basic discussions of this method, see Dixon and Mood ( 19^3) and 
Brownlee, Hodges and Rosenblatt (1953)* Some modifications are dis- 
cussed by Tsutakawa (1963, 1967a, b). 

This method requires storing only n(n + l)/2 test questions in 
the computer in advance of testing. This number can be reduced further 
by taking a few obvious shortcuts. 

12. Scoring the Answers* 

Consider the following three simple methods for scoring the 
student's responses to the test questions: 

*This section and part of the previous section are a slight revision 
of material appearing in Lord (l971e)* 
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1. The "final -difficulty score/' P n+ i > the difficulty of the 
(n + l) -th question (not actually administered) as defined 
by equation (;>). 

n 

2. The "number-right score," £ u , or the proportion" 

v=l 

i n 

right score. " — £ u . The former is the score most 

n n v 
v=l 

commonly used in scoring conventional mental tests. 

^ n+1 

> The "average -difficulty score/' P = - s P v • This 

v=2 

score is simply the average of the difficulty param- 
eters of the questions administered, omitting the first 
(since the first question is the same for all individuals 
tested) and including p ^ • 

[Before going ahead, the reader may wish to make a guess as to the 
relative merits of these three scoring methods for the up-and-down 
(fixed step size) procedure.] 

When the step size shrinks appropriately as n increases, as in the 
Robbins-Monro procedure, P n+ i a &°°& estimator of ability. When the 
step size is fixed, as in the up-and-down method, P n+1 is no longer a 
consistent estimator for © , nor does its sampling variance approach 
zero as n becomes large. It turns out that when step size is fixed, 
number-right score is perfectly correlated with P n+1 ; so it, too, can 
be eliminated as an effective method of scoring. 

Brownlee, Hodges, and Rosenblatt (1953) have shown that the average- 
difficulty score is asymptotically equivalent to the maximum likelihood 
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estimator for © found by Dixon and Mood (19^3) for the up-and-down 
method. Although no optimum small-sample properties have been proven for 
the average-difficulty score, it appears at present to be the preferable 
method of scoring tests administered by the up-and-down method. 

It frequently happens that similar groups of students are tested 
year after year. In this case, an excellent prior distribution for the 
parameter © is available, based on records of past performance. In 
such situations, the careful design of a tailored testing procedure 
would certainly be based on a Bayesian approach. The Bayesian approach 
will not be treated here since it is of greater mathematical complexity. 
The interested reader is referred to Owen (1970) and to Freeman (1970). 

1 p. Evaluation of Testing Methods 

The remaining problem for discussior. here is the evaluation of dif- 
ferent stochastic approximation procedures and of different choices 
of parameters such as d . 

Properties of the Robbins-Monro procedure for large n are dis- 
cussed in the references given. Some properties for small n are 
treated by Wasan (1969, chapt. 2) and by Cochran and Davis (1965). An 
improved procedure for small n is suggested by Kesten (1958) and tried 
out empirically by Odell (1961). 

The up-and-down rule for selecting test questions to be administered 
produces a Markov chain or, more specifically a random walk for the values 
of p v . The transition probabilities y j(©) and 3 7 j( g ) 
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are stationary. They depend on P v , but they do not depend on v when 
0 v and 0 are given. 

Starting from this, it is not hard to write down a formula for the 
frequency distribution of under the up-and-down method; but 

5( n+ i) is not a satisfactory scoring procedure for this method, as 
already noted. The frequency distribution of the average difficulty 
score (3 is not easily obtained for moderate n , but Brownlee, Hodges, 
and Rosenblatt have provided recursive form ula s from which the mean and 
sampling variance of 0 can be readily calculated numerically by 
computer for given 0 , a , ft , y , d , F(0) and for any n likely 
to be of interest. Given the bias and sampling variance of (3 for 
given 0 for each of various testing designs, it is not hard to decide 
which design is preferable for measuring at a specified ability level. 

A variety of testing designs were investigated in this fashion by 
Lord (1970b, 1971d ) . Numerical studies of a variety of stochastic 
approximation methods applicable to individualized testing are reported by 
Cochran and lavis (1964), Davis (l97l)s Wetherill (1963), Wetherill 
and Levitt (1965). Wetherill, Chen, and Vasudeva (1966). Other 
empirical studies of individualized testing include Bayroff and Seeley 
(1967), Ferguson (1971)5 Hansen and Schwarz (1968), Linn, Rock, 
and Cleary (1969* 1972), Paterson (1962), Seeley, Morton, and Anderson 
(1962), Urry (1970), Waters (1964). Waters and Bayroff (l97l)> Wood 

(1969). 
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14. Relation to Psychophysical Methods 

The up-and-dovm method is often used in bioassay. According to Guilford 
(195^ )> it originally was developed for the study of explosives. When used 
in psychophysical studies, it is known as the staircase method (Cornsweet, 

1962). 

The psychophysicist does not need to know' the precise mathematical form 
of the psychometric function. To elucidate comparison with icc theory, let 
us assume that the psychometric function actually is given by equation (l) 
or (2) with 7=0. 

Whereas the mental tester controls a and f3 (by using pretested 
items) while trying to estimate the value of 0 , the psychophysicist (or 

't 

bioassayist) controls 0 while trying to estimate the value of 3 and; J 

sometime s, the value of a . Note that 9 and 3 play reversed roles -j 

for the mental tester and for the psychophysicist. For the latter, 0 ;• 

might represent the physical intensity of the various stimuli presented 

under experimental control. Then, 3 would be the "threshold” at which 

the subject says ”yes, I detect the stimulus” F(0) of the time; oc would 

be the precision of the psychometric function. The psychophysicist chooses 

the stimulus level 0 X , administers this stimulus, and records the response 

u = 0 or 1 . He then chooses another stimulus level 0 O , administers 
ii 2 j 

t 

this stimulus, records u.g = ® ^ 3 an ^ continues in this way. 

In mental testing, we are interested only in the relative values of 
0 for different examinees; 0 is, at best, measured on an interval scale. 

The unit and zero point of this scale have little ready meaning for most I 

other scientists concerned with mental measurement. The psychophysicist, \ 
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on the contrary, usually estimates the absolute value of f3 on some standard 
scale having a unit and origin well known to physicists and other scientists. 

Avoiding bias in his estimates is therefore of crucial importance for 
the psychophysicist. In mental testing, any linear transformation of © is 
as valid as any other. Bias is usually of no importance to the mental tester 
as long as it affects all scores equally. 

The fact that the psychophysicist has two unknown parameters, a and 
f3 , creates a further problem. It is not possible for him to choose the 
step size d optimally without knowing a • A poor choice of d leads 
either to excessive standard error or bias in the estimated threshold, or 
else to experiments that are unnecessarily lengthy. 

Often the psychophysicist can obtain observations cheaply. It may be 
easy for him to obtain a thousand or ten thousand responses from a single 
subject. The mental tester cannot do this. The objective situation forces 
the mental tester to use reasonably efficient testing and estimation methods. 
For the psychophysicist, statistically efficient procedures may be unnecessary 
and distinctly uneconomical. 

In addition to the staircase method, the psychophysicist sometimes uses 
block up-and-down methods (Stuckey, Hutton, and Campbell, 19 66; Tsutakawa, 
1963, 1967a, b; Cochran and Davis, 1964) and unequal- step -size "sequential" 
methods (Taylor and Creelman, 1967 ; Pollack, 1968). The time-honored con- 
stant-stimulus method corresponds in part to conventional (not individualized) 
mental testing; the scoring methods are different in the two applications, 
however . 

The indicated correspondence between individualized testing and certain 
psychophysical experiments is clear and instructive whenever the mental 
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tester can work with items all having the same a and y accurately 
determined by pretesting. Such situations do not really exist at present, 
however. Not enough items are usually available to do practical work with 
a pool of items all having the same a and y (proponents of Rasch's 
method may disagree). 

Present work in icc theory and practice is concerned with estimating 
item and examinee parameters simultaneously. This is very different from 
the typical psychophysical problem. An outstanding current problem is how 
to carry out individualized testing using test items characterized by a 
variety of inaccurately estimated item parameters. A recent article by 
Dupac and Krai (1972) is relevant for individualized testing with fallibly 
estimated values of p. • 
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